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Abstract 

Bayes  Law,  or  the  law  of  conditional  probability,  provides 
a  natural  inference  framework  for  one  who  views  hypothesis 
testing  as  parameter  estimation.   Heretofore  a  major  difficul- 
ty in  applying  Bayesian  ideas  to  psychological  contexts  has 
been  the  specification  of  an  objective  or  public  prior.   This 
paper  proposes  a  rule  for  selecting  a  prior  hypothesis  which  is 
both  unambiguous  and  hostile  to  the  research  hypothesis:   choose 
the  prior  so  that  1)  its  expectation  is  the  conventional  null 
value  and  2)  it  has  maximum  probability  of  producing  the  obser- 
ved data.   The  rule  is  employed  to  develop  a  complete  set  of 
tests  for  nominal  data,  and  both  a  one-sample  and  a  two-sample 
test  for  difference  of  means.   Numerical  illustrations  are  included 
for  most  of  the  tests  developed. 
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1.   Characterization  of  Hypothesis  Testing  a8  Bayesian  Inference 

When  an  experiment  is  over  or  a  field  study  is  complete, the 
researcher  must  sit  himself  down  and  attempt  to  make  some  sense  out  of 
the  mass  of  data  he  has  collected.   Usually  he  employs  the  data  to  judge 
the  credibility  of  relations  among  constructs.  More  often  he  sifts 
through  the  data  in  the  hope  of  finding  some  relations.   But  in  either 
case  his  inference  task  boils  down  to  finding  an  appropriate  sampling 
model  and  then  to  computing  the  probabilities  of  the  sample  outcomes 
predicted  by  his  hypotheses. 

Up  to  this  point,  the  steps  in  the  analysis  are  pretty  much  the 
same  both  for  those  who  look  at  the  data  in  "either-or"  teirms  and  for 
those  who  feel  they  can  distinguish  varying  shades  of  gray.   Most  psychol- 
ogists are  in  the  "either-or"  camp  —  either  they  can  reject  the  null 
hypothesis  (support  the  research  hypothesis)  or  they  can't.   When  a  sta- 
tistically significant  relationship  is  found,  the  researcher  joyfully 
calls  a  halt  to  celebrate  his  good  fortune.   Very  seldom  does  he  try  to 
estimate  the  strength  of  the  relation  even  though  some  measures  do  exist. 

The  intent  of  this  paper  is  to  propose  a  very  conservative  way  for 
assessing  the  strength  of  a  relation.   A  natural  way  to  approach  this 
problem  is  to  translate  the  research  hypothesis  into  a  statement  about,  some 
output  parameter.   Thus,  the  stronger  the  hypothesized  relation,  the 
larger  the  output  parameter.   Since  the  parameter  is  a  random  variable, 
it  must  be  described  in  probabilistic  terms,  i.e.,  in  terms  of  a  mean,  a 
variance,  and  the  probability  that  the  true  value  of  the  parameter  falls 
in  a  given  range. 
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Let's  illustrate  these  notions  with  a  couple  of  examples.   Suppose 
we're  studying  the  relationship  between  union  activity  and  agreement 
with     the  union's  political  line  (c.f.  Wilensky,  1956).   We  dichotomize 
each  variable  and  then  cast  the  data  into  the  following  contingency  table. 

High  Activity      Low  Activity 


Agreement  with 
political  line 


Disagreement 
with  political 
line 


"11 

"11 

"12 

"l2 

"21 

"21 

"22 

'  "22 

"ll  "^  "l2 


"21  ■"  "22 


n  ::  n.  +  n- 


where  n   and  n   are  the  actual  and  expected  frequencies,  respectively. 
Instead  of  asking,  "Is  there  a  significant  relationship  between  union 
activity  and  agreement  with  political  line,"  we  prefer  to  pose  questions 
like,  "What  is  the  probability  that. 


O.^  < 


»Vi 


'^.^^n., 


<.  O.H 


i.e.,  that  the  difference  in  proportions  between  highly  active  agreers 
and  highly  active  disagreers  lies  between  0.2  and  O.A?" 

Or  suppose  we're  studying  the  relative  effectiveness  of  two  job 
training  programs,  on-the-job  training  versus  vocational  training  in  a 
special  school.   We  select  two  matched  samples,  assign  individuals  randomly 
to  the  two  programs,  and  observe  the  difference  in  annual  income  between 
pair  members  one  year  after  training  begins.   As  advisors  to  policy  makers 
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we're  very  interested  both  in  the  average  value  of  the  income  dif- 
ference and  in  the  probability  that  the  difference  is  at  least  $1000. 

The  law  of  conditional  probability  provides  a  natural  framework 
for  answering  these  kinds  of  questions.   Denote  the  output  parameter 

by  y  and  the  sample  data  by  the  vector  x  =:  (x.  ,x„ x  ). 

Then  the  inference  problem  can  be  transformed  into  computing  the 
probabilistic  description  of  y,  given  x.   Typically  we  describe  a 
random  variable  in  terms  of  the  probability  that  it  lies  in  a  small 
interval,  say    C  >^  o  ^  ^o-v-c^HoJ-  Thus, 

If  we  define  the  probability  density  functions 


equation  l]  may  be  rewritten  as  the  continuous  form  of  Bayes  Law: 


1    /    \        y'At^'iiAh'^-'^ 
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The  first  factor  in  the  numerator,  \^  \li\Ho\     is  merely  the 
sampling  model,  which  gives  the  probability  of  observing  the  data  x 
if  the  true  value  of  the  output  parameter  is  y,,  .   The  second  factor 
represents  our  prior  feelings  about  y,  that  is,  what  we  knew  about 
y  before  the  data  were  collected.   The  denominator  is  the  usual  normal- 
izing factor,  V^t'^'J  ,  expressed  as  a  sum  over  the  set  of  mutually 
exclusive  and  collectively  exhaustive  joint  events:   the  data  are 
X  and   u  o  t  u  t  u^  J,  <i ^^  . 

The  mathematical  form  of  the  sampling  model  is  objectively 
determined  by  the  way  the  data  are  categorized  and  by  the  relationships 
among  the  variables  in  the  research  hypothesis.   However,  there  is 
presently  no  general  agreement  on  a  procedure  for  selecting  a  prior 
distribution.   We  must  devise  such  a  procedure  if  we  are  to  fulfill 
the  promise  made  above  —  develop  a  conservative  way  to  estimate 
the  strength  of  a  relationship.   Before  proposing  our  own  solution 
to  this  problem,  let's  pause  briefly  to  review  previous  suggestions 
for  choice  of  a  prior. 

2.   Previous  Research  on  Choice  of  a  Prior 

Raiffa  and  Schlaifer  (1961,  Ch.  3)  note  that  the  choice  of  a 
prior  is  simple  enough  when  a  body  of  prior  data  exist .   One  merely 
selects  a  functional  form  rich  enough  to  accommodate  a  wide  range  of 
posterior  density  functions  (6.g.,  skewed,  flat,  multimodal,  etc.),  and 
then  chooses  prior  parameters  that  best  fit  the  data. 

When  the  researcher  lacks  either  data  or  firm  opinions,  choice  of 
a  prior  is  difficult  to  make  and  more  difficult  to  defend.   For  example, 
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in  estimating  the  parameter  in  a  Bernoulli  process,  one  can  characterize 
the  process  by   Tt  =;  JUv  (  ^  \         ,  just  as  well  as  by  f>  itself. 

Clearly  a  uniform  prior  on  -p  does  not  imply  a  uniform  prior  on    '^ 

Raiffa  and  Schlaifer  (1961,  p.  66)  conclude  that 

The  notion  that  a  broad  distribution  is  an  expression  of 
vague  opinions  is  simply  untenable  •  .  .  Notice,  however, 
that  although  we  cannot  distinguish  meaningfully 
between  "vague"  and  "definite"  prior  distributions, 
we  can  distinguish  meaningfully  between  prior  distributions 
which  can  be  substantially  modified  by  a  small  number  of 
sample  observations  from  a  particular  data  generating 
process  and  those  which  cannot,  (sic) 


By  sensitivity  analysis,  one  can  define  a  set  of  priors  which  have 
this  "open-minded"  property. 

Edwards,  Lindman  and  Savage  (1963)  outline  a  procedure  called  stable 
estimation,  for  which  the  assumption  of  a  uniform  prior  permits  a  very 
good  approximation  to  the  actual  posterior  distribution.   Consider  an 
arbitrary  interval   t^^>'H.i3    ^''^   '^he  parameter  being  estimated. 
Stable  estimation  can  be  used  when 

1)  Wi  1 'Azl   ^^  small  and  highly  favored  by  the  data; 

2)  the  actual  prior  density  changes  little  in  t  "Jj  i  ?  '^^3     ;  and 

3)  nowhere  outside  '[v^,^  ,  u^^"]      is  the  actual  prior  extra- 
ordinarily large  compared  with  its  value  inside  the  interval. 

If  these  conditions  hold,  the  posterior  is  merely  the  normalized, 
conditional  sampling  distribution,  whose  propoerties  can  be  calculated 
easily. 

Although  Edwards  et  al.  briefly  discuss  prior-posterior  analysis 
and  estimation,  their  treatment  of  hypothesis  testing  follows  the  classical 
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decision  theoretic  notion  of  using  data  D  to  choose  between  a  null 
and  an  alternative  Hypothesis,  denoted  by  H  and  H, ,  respectively. 
Noting  that 


\^rlHo\D]    --    Pcic>\VA,\   ?,[Uo\  /P.^D^^ 


i 


one  may  state  the  test  in  the  form  of  an  odds  equation: 


where 


In  general  H_  and  H  can  be  diffuse.   Let  \  be  the  parameter  of 
interest  and   tC'X")  ,  t  i'X^    be  the  probability  density  functions 
corresponding  to  H„  and  H^ ,  respectively.   Then 


Thus,  the  likelihood  ratio,   L  C  Wc,  Vn> ;  d')    is  merely  the  ratio  of  the 
unconditional  sampling  distributions  corresponding  to  H  and  H. .   In 
many  cases  it  is  possible  to  find  bounds  for  the  likelihood  function 
such  that  any  value  within  the  bounded  interval  completely  overwhelms 
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reasonable  choices  for  the  prior  odds.   Edwards  et  al.  devote  the 
bulk  of  their  analysis  to  simplifying  3a]  in  order  to  generate  these 
bounds.   They  propose  no  objective  way  to  determine  either  the  density 
functions  f^,  f.  or  the  prior  odds,  X\('Ho,H>^ 

One  of  the  few  published  attempts  to  apply  this  theory  is  Ditz's 
(1968)  analysis  of  two  experiments  on  the  perception  of  rotary  motion. 
We  shall  compare  his  analysis  with  our  own  in  a  later  section. 

A  very  different  approach  to  the  choice  of  a  prior  involves  the 
concepts  of  entropy  and  testable  information.   Suppose  one  is  estimating 
a  parameter  y.   Testable  information  about  y  constrains  the  form  of  the 
prior.  Jaynes  (1968,  p.  229)  gives  some  examples: 

1)  y  <  6  . 

2)  the  mean  value  of  ^an.VvVl-u  )   in  previous  measurements 
was  1.37  ", 

3)  there  is  at  least  a  90  percent  probability  that  y  >  10. 
A  meaningful  prior  distribution  for  y  is  the  one  which  maximizes 
entropy  while  satisfying  the  testable  constraints.   Good  (1965,  p.  73) 
reports  that 


for  a  two-dimensional  population  contingency  table 
with  assigned  marginal  totals,  the  principle  J^of 
maximum  entropy]  leads  to  the  null  hypothesis  of 
no  association.   A  similar  conclusion  applies  to 
an  m-dimensional  table.  .  .  In  general  we  could 
describe  the  effect  of  the  principle  .  .  . 
by  saying  that,  in  some  sense  it  pulls  out  the 
hypothesis  in  which  the  amount  of  independence  is 
as  large  as  possible. 

Jaynes  uses  the  principle  to  generate  several  discrete  probability 

distributions  but  notes  that  the  continuous  case  is  troublesome  because 

of  the  difficulty  in  defining  an  invariant  measure  of  the  entropy.   He 

obtains  continuous  priors  by  ad  hoc  invariance  arguments,  which  are  based 


y 
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upon  the  equivalence  of  probabilities  seen  by  different  observers. 
Since  continuous  priors  are  the  only  ones  of  interest  to  us,  we 
cannot  use  the  principle  of  maximum  entropy  until  the  continuous  case 
is  solved  by  a  well  defined  procedure. 

A  third  way  of  defining  a  prior  relies  on  extracting  information 
from  the  sampling  distribution.   In  an  analysis  of  a  normal  process  with 
an  unknown  mean  and  a  normal  prior,  Clutton-Brock  (1965)  chooses  the 
prior  parameter  in  order  to  maximize  the  likelihood,  i.e.',  the  unconditional 
sampling  distribution.   Here,  at  last,  is  something  on  which  to  build. 

Our  procedure  for  determining  a  prior  borrows  ideas  from  both 
Jaynes  and  Clutton-Brock.   In  a  word,  we  suggest  that  a  very  conservative 
prior  can  be  obtained  by  1)  constraining  the  prior  mean  to  lie  at  the 
conventional  null  point,  and  then  2)  selecting  prior  parameters  which 
maximize  the  likelihood. 

Such  a  prior  has  two  interesting  features.   First,  because  we  view 
hypothesis  testing  as  parameter  estimation,   the  argument  of  the  prior 
is  a  continuous  random  variable.   Thus,  we  have  discarded  the  "either- 
or  ,"  point  null  in  favord  of  a  "gray"  diffuse  null,  whose  expectation 
is  the  conventional  null  value  of  the  parameter.   Second,  by  choosing 
the  prior  parameters  to  max  the  constrained  likelihood,  we  effectively 
define  a  null-centered  prior  which  has  maximum  probability  of  generating 
the  observed  sample  data.   For  a  given  functional  form,  the  resulting 
prior  stacks  the  cards  against  the  research  hypothesis  as  much  as  possible. 
Consequently,  estimation  based  on  this  prior  amounts  to  an  a  fortiori 
"test"  of  the  experimental  effect. 

In  the  next  four  sections  we  apply  these  notions  to  binomial  sampling, 
analysis  of  contingency  tables,  normal  sampling  with  unknown  mean  and 
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variance,  and  sampling  from  two  different  normal  populations.   For 
analytical  convenience,  conjugate  priors  are  used  throughout;  however, 
the  method  can  be  used  with  any  distribution  which  has  a  finite  mean. 
Numerical  comparisons  with  max  likelihood  and  conventional  tests 
are  included  in  the  development  which  follows. 

3.   Binomial  Test 

In  the  course  of  binomial  sampling  from  a  population  with  unknown 
Bernoulli  parameter  p,  we  observe  r  successes  in  n  trials,  that  is, 
r  cases  fall  in  one  nominal  category  and  n-r  cases,  in  the  other. 
Moreover,  let  us  suppose  that  the  conventaional  null  value  for  p  is 
p   (0.5  is  a  very  common  value  for  p  ). 

The  conditional  sampling  distribution  is  the  familiar  binomial 
expression: 


'  re)  p'^>-p-. 


vx-r 


Hi 


The  beta  function  is  chosen  as  the  prior  on  p  both  because  a  beta 
has  a  flexible  form  and  because  it  combines  easily  with  the  conditional 
sampling  distribution.   A  function  with  the  second  property  is  called 
a  conjugate  prior.   Denote  the  prior  on  p  by  the  symbol  t-^p')  and 
restrict  the  prior  to  be  unimodal.   Then 
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where 


Bka.b)  -  \  -^/"^,--v^^"^l.^  ^  iii^iiLL^i 


and  r' ,n'  are  the  prior  parameters,  which  must  be  determined. 

The  unconditional  probability  of  obtaining  the  data  is  merely 


<.! 


which  is  a  beta-binomial  distribution  on  r.   Since  the  prior  mean 
must  fall  at  the  conventional  null  point,  we  require  that 

y' /  fx'      -    p 
and  thereby  reduce  7J  to  an  equation  in  one  unknown: 


W\^,^fV,P^- 

In  order  to  select  that  prior  which  has  maximum  probability  of  producing 
the  data,  we 


pvx:  >\ 


lol 


But   (r=SUuir       will  have  its  maximum  at  the  same  n'  as   ii  . 
Thus  llj  may  be  rewritten  as 
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When   n.'     ,  the  maximizing  value  of  n' ,  is  an  interior  point, 
its  value  may  be  obtained  by  setting  the  derivative  of  (r  equal  to 
zero.   Thus,   Yv'     must  satisfy  the  equation 


U] 


»^1 


subject  to  the  constraint 

r"   =  p,w'  >  1 

and  where      ^  ^^^  _^  I  <UV^^)    -.    ^'^^) 

Equation  12^   must  be  solved  numericallly  because  both  the  gamma 
function  and  its  derivative  are  transcendental  functions. 

By  substituting  the  series  definition  for  '^' ^"^)  into  12J  , 
one  can  show  that  there  can  be  only  one  internal  extremum  and  this 
extremum  must  be  a  maximum.   Thus, 


»3£>,Vj1 


-A. 


^.   if   ^  (?  £  o  at  Yv'  •=  ^  • 


i^l 


With  the  prior  parameters  now  determined,  we  may  solve  for  1-  (p"^   , 
the  posterior  on  p,  and  then  compute  any  posterior  statistic  we  choose. 
According  to  Bayes  Law, 


i  ^?^  - 


V3 


^^1 
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where  r  '  =  r  -v  r ' 

A,"  =.  n.  -v-  tl' 

This  very  convenient  result  flows  from  the  choice  of  a  prior  which 
is  conjugate  to  the  conditional  sampling  distribution. 

For  comparison  with  the  classical  binomial  test,  we  shall  compute 
the  posterior  tail  probability,  i.e.,  the  probability  that  p  lies  on 
the  improbable  side  of  the  null  point.   This  a  fortiori  significance 
level  is 

.P. 


^o  --  "^r  (  p^p^\  '.       ["    i\^^^, 


"   \  ?^C^\^",^'^  cip 


\<o'] 


% 


which  is  a  normalized,  incomplete  beta  function. 

The  posterior  mean  and  variance  —   p   ovvd  p         ,  respectively — 
give  a  conservative  idea  of  the  strength  of  the  effect  being  studied. 

p   -r  r  /  vv 


? 


Numerical  Examples 

To  give  the  reader  an  idea  of  the  test's  conservatism,  let's  compare 
its  results  with  some  calculations  made  by  Edwards,  Lindman,  and  Savage 
(1963,  p.  225),  and  with  an  analysis  published  by  Pitz  (1968).   Edwards 
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et  al.  computed  the  upper  portion  of  Table  1  in  order  to  demonstrate  the 
large  difference  between  their  likelihood  ratios  and  the  p-levels 
obtained  from  a  conventional  binomial  test.   The  numerator  of  each 
likelihood  ratio  is  based  on  a  point  hypothesis  located  at  p  =  0.5; 
the  denominator   of   L  (.  p^  ■,  r,  n.")       assumes  a  uniform  hypothesis; 
and  the  denominator  of  L    uses  a  point  hypothesis  placed  at   q-  r-.A. 
the  maximum  likelihood  estimate  of  p.   The  Bayesian  p-levels  are 
computed  from  equations  12] ,  14j  ,  and  16j . 

TABLE  1 

Likelihood  Ratios  under  the  Uniform  Alternative  Prior,  Minimum 
Likelihood  Ratios,  and  Bayesian  P-Levels,  for  Various  Values 
of  n  and  for  Values  of  r  Just  Significant  at  the 
.05  Level 


Experiment  Numb 

er 

1 

2 

3 

4 

n 

50 

100 

400 

10,000 

(very  large) 

r 

32 

60 

220 

5,098 

(n  +  1.96  n)/2 

L(p^;r,n) 

.8178 

1.092 

2.167 

11.689 

.11689  n 

L 
min 

■ 

.1372 

.1335 

.1349 

.1465 

.1465 

prior  n 

17 

33 

132 

^3500 

prior  r 

8.5 

16.5 

66 

»1750 

prior  p 

.640 

.600 

.550 

.510 

posterior  p 

.605 

.575 

.538 

~   .507 

Bayesian  p-level 

.042 

.041 

.041 

.042 

Conventional 
p-level 

.050 

.050 

.050 

.050 
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Note  the  close  agreement  between  the  Bayesian  and  the  conventional 
p-levels.   Because  the  experimental  effect  is  very  large  here,  the 
Bayesian  test  indicates  a  stronger  result  than  the  conventional  binomial 
test.   However,  this  apparent  lack  of  conservatism  disappears  if  the 
roles  of  r/n  =  ip      and  the  null  value  of  p  are  Interchanged  in  the 
conventional  test.   This  redefinition  of  p  effectively  reduces  the  stan- 
dard deviation  in  the  conventional  test  statistic. 

Much  more  interesting  than  the  p-levels  are  the  values  of  the  pos- 
terior mean,  p".   As  we  would  expect,   p"  ^  p  • 

Note  that  for  values  of  r/n  very  near  %,  the  experimental  effect  is 
quite  small,  according  to  the  conservative  Bayesian  estimate. 

Pitz  used  Edwards  et  al.'s  likelihood  ratio  technique  to  discriminate 
between  two  theories  about  the  perception  of  rotary  motion  in  depth.   Day 
and  Power  (1965)  hypothesized  inability  to  identify  the  direction  of 
rotation  while  Hershberger  (1967)  predicted  the  opposite  and  subsequently 
obtained  experimental  evidence  supporting  his  theory.   Thirty-two  of 
Hershberger's  forty-eight  subjects  reported  the  direction  of  rotation 
correctly,  yielding  an  exact  one  tail  p-level  of  0.0154  under  the  null 
hypothesis  of  p  =  I5   (i.e.  the  hypothesis  that  Day  and  Power's  theory  is 
correct) . 

Pitz  computed  likelihood  ratios  for  three  different  priors  on  p 

(see  Figure  1).   According  to  Pitz  (p.  254), 

The  assumption  of  a  uniform  prior  implies  that  the 
stimulus  cues  that  form  the  basis  for  Hershberger's 
theory  may  not  be  very  effective  determinants  of  per- 
ception.  A  strong  belief  in  the  effectiveness  of  these 
cues  might  be  represented  by  a  probability  distribution 
.  .  .  such  as  .  .  .  g2(p)  .  .  .   A  third  distribution, 
g„(p),  illustrates  a  still  more  firm  commitment  to  the 
idea  that  p  must  be  large  .  .  .   The  value  of  %A2I2>) 
was  deliberately  chosen  to  be  0.394,  since  for  this 
value  [Hp^;  r,  yv^l  would  be  1.0* 

*  To  avoid  unnecessary  confusion,  we  have  substituted  "p"  for  Pitz  s 

"g"  and  hpvR  used  Edwards  et  al.'s  symbol  for  the  stable  estimation  value 
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0.-S9'*  .- 


Fig.  1.  Three  possible  prior  probability 
distributions  defined  for  values  of  Q, 
given  the  correctness  of  Hershberger 's 
theory.  (It  is  assumed  that  values  of  Q 
less  than  0.5  are  impossible.) 


Table  2:   Comparison  of  Pitz's  Likelihood  Ratio  Results  with 
Bayesian  and  Conventional  Binomial  Tests 


Data:   32  successes  in  48  trials 


prior:   gj^(p) 
0.197 


L(PQ;r,n) 

g2(p) 
0.296 


g3(p) 
1.0 


mm 


0.066 


Bayesian    Conventional 
p-level     p-level 


0.0177 


0.0154 


/ 
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Table  2  compares  Pitz's  three  values  for  LCp^-,  r,\\.^    with  our 

results  for  L  .   and  the  two  binomial  tests.   The  Bayesian  p-level 

mxn  ^  *^ 

is  based  on  a  beta  prior  with  parameters  n'=11.0,  r'=5.5,  a  mean 
of  0.5  and  a  variance  of  0.0208.   Thus,  about  95  %  of  the  prior 
null  hypothesis  is  contained  in  the  interval  ^0.36,  0.641  •   Either 
a  sharper  or  a  more  diffuse  null  would  have  lower  probability  of  pro- 
ducing the  observed  data. 

A  puzzling  feature  of  both  Tables  1  and  2  is  the  wide  discrepancy 
between  the  likelihood  ratios  and  the  Bayesian  p-levels,and  the  variation 
among  the  likelihood  ratios  themselves.   However,  one  must  remember 
that  a  Bayesian  p-level  and  a  likelihood  ratio  are  two  very  different 
beasts:   the  first  is  the  probability  that  the  parameter  of  interest 
lies  in  a  certain  portion  of  the  tail  of  the  posterior  hypothesis;  whereas, 
the  second  compares  the  probabilities  that  two  rival  hypotheses  produced 
the  data.   If  the  rivals  are  point  hypotheses,  e.g.,  p  -  ^   and  p  -  p  , 
and  if  one  hypothesis,  say  p  ,  lies  so  far  out  on  the  tail  of  the 

o 

sampling  distribution  that 

then  the  likelihood  ratio  and  Bayesian  p-level  ought  to  be  at  least  the 
same  order  of  magnitude. 

But  if  one  hypothesis  is  diffuse,  then  the  tactic  of  stable  estimation 
permits  one  to  generate  virtually  any  value  of  LCU,iH,-,D^  he  pleasesi 
As  Pitz  has  demonstrated  so  well,  one  merely  takes  care  to  sketch  an 


of  the  likelihood  ratio. 


-20- 


hypothesis  which  has  the  appropriate  value  at  the  mode  of  the  sampling 
distribution  (see  Figure  1).   Of  course,  it  is  not  feasible  to  pick 
this  modal  value  of  the  hypothesis  before  the  data  are  collected.   But 
worse  yet,  Edwards  et  al.  give  us  no  objective  way  of  picking  it  even 
after  the  data  are  in  hand.   The  test  developed  here  represents  one 
attempt  to  put  the  choice  of  the  prior  on  an  objective  basis. 

In  the  next  section  of  the  paper  we  extend  the  binomial  test  to 
the  analysis  of  contingency  tables. 

4.  Analysis  of  Contingency  Tables 

Consider  an  Ixm  contingency  table,  with  p  and  p..   being. respectively, 
the  actual  and  conventional  expected  values  of  the  population  proportion 
corresponding  to  the  i,j  segment: 


p 

Ho 

P.t 
p 

P^^ 

P^3 

We  feel  that  the  probabilities  of  interest  to  the  researcher  are  of 
the  type 
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^1       .JO   ] 


'^r\cx^\?..-9.\^-rA 


etc, 


o»N<i   p. 


^v.  ^ 


Our  goal  is  to  derive  expressions  IST  for  the  general  m  x  n  table. 
However,  we  will  work  up  to  this  result  gradually  by  first  analyzing 
the  1  X  m  and  2x2  cases . 


^h^ 


A.l  The  1  X  m  Table 

Suppose  an  independent  random  sample  yields  r.  events  in  the 
i   category,  there  being  n  events  in  all.   Define  p.  =  v. /r\. 
and  p   as,  respectively,  the  actual  and  expected  proportion  of 
events  of  type  i.   Thus,  the  contingency  table  for  the  data  looks  like 
this: 


Category 

Actual  Frequency 
Actual  Proportion 

Expected  Proportion 


Totals 


I 

z 

3 

1 

L 

1 

«n. 

»~ 

'"z 

'i 

r. 

~».^ 

^ 

K 

5=3 

P^ 

^o 

K. 

-,. 

1 

P. 
10 

1 

The  conditional  sampling  distribution  is  the  multinomial; 


v*» 


191 
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where 


:     1 

AC 

:   r\ 

As  for  the  binomial  case,  the  max  likelihood  estimate  for  p.  is 
merely  r./n.   This  is  easily  shown  by  substituting  20J  into  19] 
and  then  differentiating  the  log  of  the  resulting  expression. 
Using  the  binomial  development  as  a  guide,  we  select  as  prior  the 
multivariate  beta  density  with  means  equal  to  the  conventaionlly 
defined  expected  proportions. 


v.-\ 


where 


O  ^  p  £  1 
I,  • 

vv\. 


I -I 


r'    =   ^'  p    \  i 


t^l 


ZZ  a.Vi.cJ 


The  unconditional  sampling  distribution,  the  probability  of  obtaining  the 
observed,  data  is  obtained  as  before. 
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15-j 


Upon  substituting  22cl  into  25j, taking  the  natural  log,  and  differen- 
tiating with  respect  to  n' ,  we  obtain  the  equation  which  determines 
the  internal  maximizing  value  of  n'. 


'^^] 


The  global  max  rv*   is  either  m  or  the  solution  to  26j  ,  whichever 
yields  the  larger  value  of  ^"^   .As  in  the  1x2  case,  26j 
is  a  transcendental  equation  and  must  be  solved  numerically. 

Because  the  multivariate  beta  prior  IrCp^'r  ^     is  conjugate  to 
the  conditional  sampling  distribution,  the  posterior  '^(p\r',r^     is 
also  multivariate  beta.   In  particular. 


?. 


f"  -T-r   ^ 

V      I  '«»V-\        '        T»V   )    \  YVN.   I  'lip    ^l'. 


t^l 


where 


O  t  P  i  J. 
^"    =  r  +  r' 

2-8  a  lo.cj 
The  development  of  the  1  x  m  test  is  completed  with  a  calculation 
of  two  event  probabilities:   1)  P,  •  p.  c  P  s  ^  o<;^  , 
the  probability,  say,  that  the  1   proportion  actually  lies  on  the 


indicated  side 


of  the  null  point;  and  2)  yj  (  ^  i  ^p_-50  ^  <=<;-i^') 
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the  probability  that  two  proportions  differ  by  at  least   ^  . 
Clearly 


P  ' 


to 


O     o 


■^1 


3  0] 


with       ;\  ^   i_  -  )   p 

Although  formidable  appearing,  29]  is  easily  integrated  to 
yield 

where  L  Ce».,b^    Is  the  incomplete  beta  function: 

It 

In  order  to  calculate   Oc  A  A,^     we  begin  with  the  marginal 
on  p  and  p  .  With  no  loss  of  generality,  we  can  assign  i  and  j 
the  values  1  and  2.   The  calculation  of   O.     easily  leads  us  to 
discover  that 

with 

-\ 
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Now  define 


^-  ^  ?^-\\ 


3-^1 


and  consider  the  event   ^x.\    ~  '-^  .We  must  consider  two  cases 

is  portrayed  in  Figure  2.  Let's  analyze  this  situation  first. 


The  event   ^  i  /!l   for  a  negative   Z\ 


Figure  2:   Region  of 
Integration  Used  in 
Computing  the  Cumulative 
Probability,  v^^^^,  t^^'^ 


Clearly, 


^-l^^«^^^  \\<i^«=»P^^<^^^P,*<-Mr».vv^ 


^ 


A. 
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where 


and 


A  -=  \^'dp,  \\p^  ^%;'->  ^-^-p/-  iH,.,„-i 


After  the  transformation 


•)^  -  ?,  /^^-^*^ 


the  integral   ^    becomes 


o       'o 

1 
with 


351 


^3  =   ^?^-^^  /O-?^^ 

If  r  ,  r-,  r_  are  integers,  one  may  use  formula  ^IG.la'iJ  in  Grobner  and 
Hofreiter  (1961)  to  obtain 


where 


C  VK',  d•,•l-'^  -  Vrv  C.NVL^.a'^^VW+^.cl') U^'^^'-'V-^^cil 


Finally ,xil,  can  be  expressed  as  a  sum  of  hypoergeoroetric  functions, 

by  applying  formula  [3.2.1l]  from  Gradshtein  and  Ryzhik  (1965,  p.  287) 


-w=o 


3-t] 
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where 

combining   34  and  37   ,  we  obtain  o<  ^  for  the  case  of   /\,  4. 0  ■ 
For  the  case   ^  >  O   ,  one  can  now  show  quite  easily  that 

with  ^5  H  i  u" 

Let  us  summarize  the  above  calculation  of   c*  C  ^^  : 


where 


andij),  ,  is  defined  by  equations  37]  and  36bJ  . 

Example 

A  trichotomy  with  15  observations  provides  a  simple  numerical 
application  fo  the  1  x  m  test: 


o^Ja.b"] 
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Category 

Actual  Frequency 

Expected  Frequency 


1 

2 

3  ; 

1 

4 

10  ! 
1 

5 

5 

5 

J 

Totals 

15 
15 


Because  the  formulae  for  cx..(M   are  rather  complicated,  we  shall 
limit   the  analysis  to  calculation  of  the  Cx.^  . 

For  these  data,  the  global  max  of  the  unconditional  sampling  dis- 
tribution occurs  at  the  boundary  points  n'  =  3.   The  posterior  para- 
meters and  individual  "tail  probabilities"  can  be  immediately  computed 
from  22]  and  30]  .   Table  3  compares  these  Bayesian  results  with  a 
conventional  ')i.   analysis. 


Bayesian  Analysis 

,  ...  ....  ...    _.^ 

Conventional  Analysis 

Category 

1 

2 

3 

P    -  O.OOTS> 

Prior  Parameter 
Posterior  Parameter 

1 

2 
0.990 

1 
5 
0.719 

1 
11 
0.007A 

Table  3:   Comparison  of  Bayesian  and  Conventional  Analysis  of  a 
1x3  Contingency  Table 


Note  that  the  conventional  test  tells  one  only  that  the  data,  taken  as 
a  whole,  constitute  a  compound  rare  event.   In  contrast,  the  Bayesian 
analogue  looks  at  each  cell  and  provides  1)  a  tail  probability  which  is 
maximally  prejudiced  against  the  data,  and  2)  a  conservative  mean  value 
for  each  proportion. 
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Of  course,  we  could  compute  a  global  statistic  comparable  to  the 
conventional  )L  .   It  would  be  merely  the  probability  of  the  com- 
pound event , 


I  ^  ^  ^,0  ^ 


r-   -  ? 


^.    , 


where  the  inequalities  are  chosen  to  represent  the  case  exemplified  by  the 
observed  data.   But  why  do  it?  We  are  really  interested  in  knowing 
how  much  confidence  to  place  in  the  detailed  structure  of  the  data. 
The  Bayesian  test  gives  conservative  answers  to  queries  about  fine  struc- 
ture; whereas,  the  conventional  approach  affords  such  answers  (although 
rarely  required)  by  the  clumsy  technique  of  binomial  tests. 


4.2  The  2  X  2  Table 

The  analysis  of  a  2  x  2  contingency  table  is  the  next  step  toward 
the  treatment  of  the  1  x  m  case.   Suppose  we  have  drawn  independent 
random  samples  from  populations  I  and  II,  and  then  categorized  each 
observation  as  "A"  or  "not-A".   Such  data  are  usually  presented  in  a 
2x2  table: 

Attribute 


Population 


A 

v\ot- A 

1 

T^ 

^^-^T 

a 

11 

<! 

a,-r. 

A, 

N-.  V  c. 

^'V-r,-r^ 

n. 

where  r, ,  r„  ~  observed  frequencies  of  type  A  in  populations  I  and 
II,  respectively; 
n,  .n-  ^v^  size  of  sample  from  populations  I  and  II,  respectively. 
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HO 


Suppose  p,  and  p„  are  the  true  proportion  of  type  A's  in  the 
two  populations.   Then  the  conditional  sampling  distribution  is  the 
product  of  two  binomials. 

A  convenient  conjugate  prior  is  the  matrix  beta  : 

where   .       •  H-\3 

0  ^  p  ^  1 

And  the  unconditional  sampling  distribution   ii     is  computed  in  the 
usual  way. 

P  =  t   cip  p  (Wf)^    ir.)f^(^-pj   )dpp   (»-p;i    UJp,^\-?^^ 

where 

nl'.  =  vx',  -v-  v\. 

In  order  to  determine  internal  extrema  of  43]  ,  we  set  d  I.  iUv  ir  J  -^ 
and  so  obtain  the  two  independent  conditions. 
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where   _      ,   ,      n; 

in  accord  with  the  rule  of  equating  prior  means  with  conventional 
null  values. 

Equations  45J  have  precisely  the  same  form  as  12]  ,  the 
condition  determining  the  internal  max  in  the  binomial  test.   Thus, 
we  immediately  conclude  that  the  global  max  for   it    occurs  at  the 
point   (  ix'^  ,  Y\\^  ,       where 


So  iu.'flOir\.   XO   -tS^   otVxar-'vsji.S'Z 

the 
Because  matrix  beta  prior  is  conjugate  to  the  conditional  sampling 

distribution,  the  posterior  is  also  matrix  beta,  that  is. 


with  posterior  parameters  defined  by  equations  44]  above. 
The  statistic  of  interest  to  the  researcher  is 


A 


u    '-   ^-P, 


the  difference  in  the  proportions  of  A's  in  the  two  populations. 

It  is  a  straightforward  matter  to  compute  \ ^  ^^      >  the  probability 

density,  and   P  (.^"J      ,  the  cumulative  distribution,  for  this 
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difference  if   r'  r\. '      are  restricted  to  integral  values; 


UC  B(r.r^^-^,^  (  Zl\'"  C\-£^^ ''  "^"^  F  [r;  ^  r.-u,^  \ ,  \-r^  ,  fv^-r^vrT  :  i:_^  .,  i-a"] 


K  l^  ,r,-u,s-\,  \-r,  .Tv-L-t^vq  :  1 


r — I   ('*\,-t7-l^-i  -I 


\ 


^        x-t. 


where 


/ 
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To  avoid  needless  clutter,  primes  were  omitted  in  48]  -  50]  . 

The  expression  for  the  cumulative  is  particularly  simply  when 
£^  =  O     •  Moreover,  for  this  special  case,  it  is  possible  to  get 
a  closed  form  expression  for  non-integral  values  of  r.  and  n.  (we 
are  still  omitting  all  primes) : 


S'] 


This  looks  like  a  mess,  but  it  really  Isn't  so  bad  because  it 
will  always  be  possible  to  choose  n„  and  r„  such  that   \  +  r  -u  ^O 
thereby  causing  the  series  to  terminate  after   '^t-'f\-l      terms. 
We  attempted  to  find  a  simpler  expression  for    .F       but  had  no 
success. 

If  n.,  r.  are  integral,  5l]  can  be  simplified  somewhat: 


It 


where 


and  r  ,  n.  are  integers. 
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This  completes  the  development  of  the  2x2  test.   Before 
extending  the  analysis  to  an  m  x  n  table,  let's  pause  to  apply  the 
2x2  results  to  a  numerical  example. 
Example 

Suppose  a  study  of  attribute  A  in  two  populations  yields  the 
following  data: 


Population 
II 


Attribute 


not -A 


15 


(10) 


(10) 


20 


15 


20 


(10) 


20 


(10) 


20 


40 


Table  4  compares  a  conventional  analysis  with  a  Bayeslan  analysis 

in  which  prior  parameters  are  restricted  to  integral  values.   Calculation 

of  other  posterior  statistics,  for  example. 


Bayesian  Analysis 
Prior  parameters: 


Posterior  parameters: 


P^ii^-P^<0]   --    0.0OI38 


"T"" 


r  ' 

--  z 

r..; 

-H 

r  " 

--\1 

r^.'• 

--IH 

Conventional  Analysis 


.  ^ 


/(i  6i)  ^    10 


i         ^       *^-,  ^    0.00C18 


Table  4:   Comparison  of  Bayesian  and  Conventional  Analysis  of  a 
2x2  Contingency  Table 
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the  mean  difference  in  proportions,  is  omitted;  however,  the 
requisite  expressions  may  be  found  in  ASJ  -  50]|  above. 


4.3  The  q  X  m  Table 

The  results  of  sections  ''*.l  and  4.2  are  easily  extended  to 
an  G  X  ra  test.   The  conditional  sampling  distribution  is  a  product 
of  1  multinomials,  each  having  m-1  independent  proportions.   But 
more  important,  the  prior-posterior  family  is  an  11  -  (m-1)  matrix 
beta  of  the  form 


M-Q 


it     it 


P 


'4.W. 


TV  { rcn^^  TT 


^v, 


^r' 


r-  ^'^^? 


5  51 


where 


1  -r.?. 


As  in  the  1  x  m  case,  we  are  interested  in  the  probabilities 

^^  ^  P.-  ^  P..  '^  . 
p    being  the  null  or  conventional  expected  value.   But  it  is  easy 

to  see  that  the  expression  for  54]  is  the  same  as  the  corresponding 

quantity  in  the  1  x  m  analysis.   We  need  merely  insert  a  subscript 

in  the  1  x  m  result.   Moreover,  the  probabilities 


SHI 


'5 

have  the  same  form  as  the  1  x  m  results,  and  the  probabilities 


^^^^.r^y.'-^\ 


5  5  a] 


SSwl 
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have  the  same  form  as  the  expression  just  derived  for  the  2x2 
case.   Thus,  we  now  have  in  hand  a  complete  Bayesian  counterpart  of  the 
classical  analysis  of  nominal  data.   In  the  two  sections  which  follow, 
we  use  these  same  notions  of  conservative  estimation  to  develop  a 
one  sample  and  a  two  sample  test  for  normal  populations. 


5.   One  Sample,  Difference  of  Means  Test 

Suppose  one  has  obtained  n  samples  (x, ,  x„,  .  .  .  x  )  from  a 

i    /         n 

normal  population  with  unknown  mean  and  variance,  >-<.  and  C       , 
respectively.   The  conditional  sampling  distribution  is, 


V'.  i-t 
t  ■ 


^U^-^^^i  -{^1   ..p^h^-^.,^.^n 


ZTT 


J  ^^Pf^  L.  ^>^-A*M  ^'-i 


where  we  have  used  the  precision  W   ^  /cy  instead  of  the  variance. 

By  differentiating  the  logarithm  of  56]  ,  it  is  easy  to  show  that  the 

maximum  likelihood  estimators  of  ,n.   and  W       ,  are  merely  the  sample 
mean  and  sample  precision,  that  is. 


IX 


^ 


'  i:>. 


^  Z_  A;    ■=  ^-/v 


r\. 


Since  u  and  Vv.   involve  all  of  the  information  necessary  to  estimate 
the  true  mean  and  precision,  we   choose      the  sample  mean  and 
precision  ^s   the  sufficient  statistics  which  must  appear  in  the  prior 
on  u.   and   W 


fe^  e..^-] 
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Raiffa  and  Schlaiffer  (1961,  pp.  298-303)  note  that  a  normal- 
gamma  prior  is  conjugate  to  56]  when  both  the  mean  and  variance 
are  unknown.   Thus,  if  we  choose  a  prior  of  the  form, 

the  posterior  is  also  normal-gamma,  with  parameters   tv^."  v\", -t' ",  V 
which  are  simple  functions  of  the  prior  parameters  and  the  data: 

W"  =   YX  *  w' 

■2^  -5.  n.  -  \  ;     t> '  =  n'  -  I 


V"=  i^Vr-i^-'V'^  ':!:}'  u^-■.-^'^'l  /v- 


To  apply  our  method,  we  begin  with  the  unconditional  sampling 
distribution; 


'  J 


OO       oo 


\j,.  \  av  «^i^,.,.,.-.^  \.^^■■^■-^(sr^-P\-'^E(n-^fi 


-»>o       o 


r('i')  (^-""^"^^■'^' 


^^1 


5^1 


J 
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Now  we  choose  that  member  of  the  normal-gamma  family  which 
maximizes  60  subject  to  the  constraints 

VV    >  1 

The  first  constraint  is  imposed  by  the  choice  of  prior,  whereas, 
the  second  corresponds  to  the  conventional  null  hypothesis  of 
"no  mean  difference."   In  view  of  61b]  ,  the  expressions  for  the 

posterior  mean  and  variance  become        '  / 

Yrv.v\ 

'f'T."  -      rv+  vx~ 

Upon  equating  to  zero  the  two  partial  derivatives  of   S:    ,  we 
obtain  the  conditions  for  an  unconstrained  internal  extremum: 


f-  %.f  -o  --  ^.  -  ^:  i^"'  <^^^i 


d  Y\ 


Equation  63a]  yields  the  very  interesting  result 

In  other  words,  the  prior  parameters  are  chosen  in  such  as  way  that  the 


/ 
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posterior  variance  does  not  decrease  from  its  prior  value.   Moreover, 
when  6A]  is  combined  with  62b]  to  yield, 

■A,  YV" 

we  note  that  the  prior  variance  generally  exceeds  the  sample  variance. 
Thus,  this  particular  choice  of  prior  parameters  will  usually  not 
add  to  the  precision  of  our  estimates.   Apparently,  this  result 
depends  upon  the  functional  form  of  the  prior;  a  gamma-1  prior 
yields  a  much  more  complex  relation  between  V'  and  V". 

After  using  62bj  and  64j  to  compute   —   ,     we  may  combine 
63b] and  65]  to  obtain  a  single  condition  for  constrained  internal 
extrema : 


where  V'  is  defined  by  65]  above.   Thus,  the  maximizing  value  of 
n'  is  defined  by  the  expression. 


(£<£,3 


1   \*j  WtYv  _  t  o  at  v\'  =  1 

tVvd.   Solution,  to  <o<»l   o\.Vs.rvN;S4   J  '^l  ] 

The  one  sample  test  is  completed  by  calculation  of  certain 

posterior  statistics  of  interest.   An  examination  of  58]  reveals 

that  the  marginal  distribution  on  h  is  I  {\\.\'^"  u")  .   Whence 

■Til 

it  is  easy  to  calculate  the  posterior  mean  and  variance  of  h: 

V."  -  ^/V" 

v^"   =    ^yt^"V"  C086/0] 
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After  a  little  manipulation,  one  can  show  that  the  marginal 
on       ^\         is  a  general  Student  density  function,  i.e.. 


4  K/J^  \  vn".  \C,  -v",  'SJ"\    =   V  {,yx  \   o^."  ~„  .  L/'-) 


whose  mean  and  variance  are,  respectively, 


JU.  -     v^"   ,    -I-'  "  >  1 


Finally,  the  Bayesian  tail  probability  for  lk.         is  merely 


where  A(tli')  is  the  tabulated  student  integral  for   degrees  of 
freedom: 


t 
-t 


1  /  1      -s( 


^B(i-n  u 


09l 


VOO.\o"^ 


(i-^)  'dx  ^1] 


Table  5: 
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Comparison  of  Bayesian  and  Conventional  Difference  of 
Means  Test:   t-statistic  for  m=l  and  varioug  values  for 
n  and  "V. 


t-statistic 


n   \^ 

0.25 

1 

■  1 
.0 

4. 

0 

16. 

D 

Bayes . 

Conv. 

Bayes. 

Conv. 

Bayes. 

Conv. 

Bayes. 

Conv. 

10 

4.61 

6.32 

2.45 

3.16 

.421 

1.58 

0 

.790 

20 

7.44 

8.94 

4.01 

4.47 

1.59 

2.24 

0 

1.118 

30 

9.62 

10.95 

5.08 

5.48 

2.24 

2.74 

.121 

1.37 

40 

11.43 

12.65 

5.97 

6.32 

2.76 

3.16 

.641 

1.58 

50 

13.02 

14.14 

6.75 

7.07 

3.16 

3.54 

1.02 

1.77 

60 

14.45 

15.49 

7.44 

7.75 

3.55 

3.87 

1.28 

1.94 

70 

15.75 

16.73 

8.09 

8.37 

3.88 

4.18 

1.51 

2.09 

80 

16.96 

17.89 

8.68 

8.94 

4.18 

4.47 

1.70 

2.24 

90 

18.09 

18.97 

9.24 

9.49 

4.47 

4.74 

1.88 

2.37 

100 

19.16 

20.00 

9.76 

10.0 

4.74 

5.00 

2.0A 

2.50 

One  tai 

L  Probab 

Llity : 

10 

.00037 

.00006 

.01448 

.00575 

.33735 

.07415 

.5 

.22476 

20 

.00000 

.00000 

.00029 

.00013 

.05993 

.01877 

.5 

.13874 

30 

.00000 

.00000 

.00000 

.00000 

.01519 

.00521 

.45179 

.09070 

40 

.00000 

.00000 

.00000 

.00000 

.00402 

.00151 

.26108 

.06096 

50 

.00000 

.00000 

.00000 

.00000 

.00123 

.00045 

.15549 

.04166 

60 

.00000 

.00000 

.00000 

.00000 

.00034 

.00013 

.10062 

.02880 

70 

.00000 

.00000 

.00000 

.00000 

.00010 

.00004 

.06724 

.02007 

80 

.00000 

.00000 

.00000 

.00000 

.00003 

.00001 

.04537 

.01408 

90 

.00000 

.00000 

.00000 

.00000 

.00001 

.00000 

.03128 

.00993 

100 

.00000 

.00000 

.00000 

.00000 

.00000 

.00000 

.02170 

.00703 
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Table  6:   Comparison  of  Bayesian  and  Conventional  Difference 

of  Means  Test:   One-Tail  p-levels  for  m=l  and  various  values 
for  n  and  T. 


Part  I:   Bayesian  Difference  of  Means  Test 


LEX 

LEX 

LEX 

LEX 

LEX 

GDP 

WTS- 

WTS- 

WTS- 

WTS- 

WTS- 

WTS- 

WTS- 

RANKS- 

y   RANKS- 

•  EGO- 

LEX 

LEX 

LEX 

GDP 

LEX 

LEX 

GDP 

LEX 

GDP 

LEX 

WTS 

RANKS 

EGO 

EGO 

RANKS 

EGO 

EGO 

EGO 

EGO 

EGO 

Prior  n 

oo 

9 

38 

11 

11 

■  "-.) 

1005 

8 

14 

91 

ndf 

29 

58 

31 

31 

1025 

28 

34 

111 

A 

.0103 

.1073 

.0292 

.0279 

.0868 

.0495 

.0494 

.0330 

.0285 

.0060 

(m)„* 

.0301 

.2027 

.0685 

.0962 

.1656 

.0430 

.0706 

-.1161 

-.0879 

-.0275 

Posterior  m 

.1795 

.0311 

.0803 

.1378 

.0018 

-.1068 

-.0671 

-.0066 

t 

2.296 

1.067 

2.069 

2.025 

.2031 

2.416 

1.796 

.6844 

P  one-tail 

.5 

.0145 

.1453 

.0235 

.0258 

.5 

.4196 

.0112 

.0407 

.2476 

Pari 

:  II:   ( 

"onvent: 

Lonal  Differei 

ice  of  I 

-leans  T( 

2St 

ndf 

on   - 

on 

iK) 

t 

1.361 

2.835 

1.837 

2.637 

2.575 

.8858 

1.455 

2.930 

2.388 

1.623 

P  one-tail 

.0943 

.0051 

.0406 

.0079 

.0090 

.1931 

..  . 

.0806 

.0041 

.0135 

.0601 

*  The  subscript  "tr"  denotes  the  value  of  the  sample  variance  and  mean 
after  each  observation  x.  was  transformed  to  the  infinite  interval 
according  to  the  expression  (x.)   =  tan(IIx.) 
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Examples 

Two  sets  of  results  are  presented  as  illustration  of  the 
Bayesian,  one  sample  "t-test"  developed  above.   The  first  set  of 
results  (Table  5)  merely  compares  the  Bayesian  and  conventional 
t-test  for  a  sample  mean  of  1.0  and  a  sample  standard  deviation 
ranging  from  0.5  to  4.0.   Note  that  the  Bayesian  p-levels  are  always 
conservative  relative  to  the  conventional  values.   Indeed,  for  large 
variance  and  small  n  the  Bayesian  p-level  is  about  1/3  -  1/4  the  value 
of  the  conventional  figure. 

However,  the  real  power  of  the  Bayesian  approach  is  shown  in 
Table  6,  where  the  posterior  mean  gives  a  conservative  estimate  of 
the  effect  under  study.   The  data  for  this  example  come  from  a  study 
of  two  varieties  of  models  used  to  describe  individual  choice  behavior 
(Lavin,  1969).   The  first  type,  called  the  weights  model,  is  denoted 
by  WTS  in  Table  6.   In  this  model,  the  decision  maker  is  supposed  to 
weight  all  the  attributes,  score  each  alternative  on  each  attribute, 
and  then  base  his  choice  on  the  sum  of  weighted  attribute  scores. 

In  the  second  type  of  model  the  decision  maker  is  required  to  do 
mudi  less  computation.   He  merely  ranks  the  attributes  and  then  ranks 
alternatives  on  the  most  important  attribute.   Lower  order  attributes 
are  used  only  to  break  ties.   Table  6  uses  the  symbol  LEX  to  characterize 
the  lexicographic  structure  of  this  model.   The  second  element  of  a 
LEX  model's  name  refers  to  the  source  of  attribute  ranks:   WTS 
and  RANKS  denote  sets  of  ranks  which  are  based  on  direct  queries  about 
attributes  weights  and  ranks,  respectively;  EGO  denotes  an  indirect 
measure  of  importance  using  Sherif's  (1965)  notion  of  ego-involvement. 
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Results  are  also  reported  for  a  variety  of  the  lexicographic 
model  which  uses  aspiration  level  data  to  dichotomize  attributes 
into  selection  criteria  and  constraints.   The  symbol  GDP  stands 
for  Generalized  Decision  Processor,  the  name  given  to  the  non-routine 
decision  program  in  which  this  type  of  lexicographic  choice  routine 
appears.  (Soelberg,  1967) 

The  comparisons  made  in  Table  6  are  based  on  the  differences 
in  rank  order  correlations  between  predicted  and  actual  alternative 
ranks.   Because  the  coefficient  of  correlation  ranges  between 
-1  and  +1,  the  correlation  differences  must  first  be  transformed 
to  an  infinite  interval  in  order  to  be  consistent  with  the  assumed 
normality  of  the  underlying  distribution. 

An  examination  of  Table  6  reveals  some  interesting  differences 
between  the  two  tests.   When  there  is  an  unmistakable  difference, 
as  in  the  WTS-LEX  RANKS  comparison,  the  Bayesian  test  produces  a 
p-level  which  is  pleasingly  small,  though  nonetheless  three  times 
larger  than  the  conventional  value.   But  more  important,  the  Bayesian 
analysis  states  that  0.18  is  a  conservative  estimate  of  the  difference 
in  model  accuracy,  a  difference  of  considerable  "practical  significance." 

In  a  similar  vein,  we  note  that  the  WTS  model  is  significantly 
more  accurate  than  the  LEX  EGO  model  at  the  0.05  level  of  the  conven- 
tional test;  however,  this  difference  is  much  less  significant  when 
viewed  with  Bayesian  eyes.   Indeed,  a  posterior  estimate  of  the  mean 
difference  in  precision  is  only  0.03  which  is  hardly  impressive  by  any 
standards. 

In  the  next  section  we  develop  the  last  test  in  the  paper  by  exten- 
ding the  results  of  section  5  to  the  case  of  two  independent  samples. 
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6.   Two  Sample  Difference  of  Means  Test 

The  usual  difference  of  means  problem  requires  one  to  infer 

the  mean  difference  between  two  populations  when  one  knows  neither 

the  mean  nor  the  variance  in  either  population.   Suppose  one  has 

two  independent  random  samples  (x, ,  ,  Xi^.  •  •  .  x   )  and  (x  .   x„„, 

.  .  x^   )  of  size  n,  and  n„,  respectively,  and  suppose  that  the 
^n„  i      / 

parameter  of  interest,  x,  is  normally  distributed  with  means  ^i.^ ,  i;^_, 
and  precisions  K,,Vt.^   in  the  two  populations. 

The  conditional  sampling  distribution  is  , 


R-  {    "£v  ,'Lt   I  ^v>  V\^,  M.,  lvz,,^,>-J^  = 


and  the  max  likelihood  estimators  are 


Ar-  ^-^^iL-^c; 


^^1 


Thus,  using  the  one-sample  analysis  as  a  guide,  we  select 
a  multivariate  prior  which  is  a  product  of  two  normals  and  two  gammas: 

7  Hi 
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Clearly,  the  posterior  will  be  of  the  same  form  because  we  have 
chosen  a  prior  which  is  conjugate  to  the  conditional  sampling  dis- 
tribution.  One  may  immediately  write  the  posterior  parameters  by 
subscripting  the  one  sample  results  of  59j  . 

W".  -.    Yl.  V  A-V 

•V''       ^  -V-    ■¥  -v.'     4-1 

^c  -  ^"t— \  ,  V.'  -   vx:  -  \ 


Similarly,  the  unconditional  sampling  distribution  is  merely 
the  product  of  two  one-sample  expressions: 

To  determine  prior  parameters,  we  translate  the  "no  difference"  idea 
of  the  null  hypothesis  into  the  condition 


Yvx'  =  wv'^  ■=  Wv' 


-\C\ 


which  constrains  the  internal  extrema  of  72]  .   Upon  setting  the  partial 
derivatives  of  jlru  IP   to  zero  and  applying  77]  one  obtains  the 
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following  set  of  simultaneous  equations; 


With  the  aid  of  75"]  and  77^  ,  these  equations  can  be  reduced  to 

the  set  ' 

(n.>  y\;  rx",  V;  W .  -v  (^Ax n :j  <  V/  ^  ^^ 
vw  - 

( A.  vi;  vx;;  v;  ^     V    C  ^i ^r  f^r  v; ) 


o  -  .^.  ^  [>(^V  -^  (  ^ll  ^ «- i^^  -  ^ 


v; 


TO 


-i»/'o,:  i 


As  in  the  one  sample  case,  our  particular  choice  of  a  prior  causes  the 
prior  and  posterior  variance  to  be  equal  for  each  population. 

The  set  79j  can  be  reduced  to  three  equations  by  the  simple  arti- 
fice of  substituting  for  V'  in  79  b,c]   : 


o  -- 


?)Oa,Wl 
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This  is  quite  a  simplification.   But  we  can  boil  80a, b]  down  a  bit 
more  by  making  the  following  definitions: 


^;- 


Equations  80b]  can  now  be  rewritten  as 


Q    z.       A,-  ^^^--m'^" 


C.-v  d^^Xr^-^-^')^ 


which  yield  the  expressions 


%ll 


Upon  substituting  82]  into  80a"]  and  eliminating  m*  from  82]  ,  we 
obtain  two  simultaneous  equations  in  nj^'  and  n2  . 


^.-  V-,dA.       r^  V  ^d.A, 


n.     •      >    >  Wt    2- 


m^- vn, 


A':- a:  ^^-.^^i 
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Equations  83J  exhibit  a  most  convenient  structure,  namely. 


yn.. 


^H  a^"] 


We  see  no  way  to  simplify  these  equations  further.   Nonetheless, 
it  appears  that  a  numerical  solution  would  be  quite  straightforward. 
One  merely  plots  two  curves  in  the  n. '-n„'  plane  —  one  for  which 
u,  +  u„  =  0  and  one  for  which   \r,  +  Vj^  -  ^^n.    Their  intersection 
defines  the  values  of  n,'  and  n^'  for  an  internal  maximum  (c.f.  Figure 
3) 


la 


1.  I, 


Figure  3:   Sketch  of  a  Numerical  Method  to  Solve 
for  Prior  Parameters  when  P   has 
an  Internal  Maximum 


Now  that  the  prior  parameters  have  been  determined,  all  that 
remain  to  be  computed  are  the  desired  posterior  properties.  Although 
the  variances  in  the  two  populations  are  interesting,  it  seems  to 
us  that  the  researcher  is  invariably  keen  to  know  certain  statistics 
of  the  difference  between  the  means, 
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In  particular,  we  expect  him  to  be  interested  in  the  posterior  mean 
and  variance  of  'i   --    i    and  i    respectively  -  and   R- [  j  >  c< } 

Of  course,  the  calculation  of  each  of  these  quantities  begins 
with  the  joint  distribution  on  /^. ,  Mr  .  H>,  V\.^  •     Examination 
of  74]  indicates  that  the  joint  is  factorable  into  two  parts,  each 
of  which  resembles  the  joint  in  the  one-sample  case.   Thus,  the 
marginal  on   /-'i.Mi     is  a  product  of  two  non-central  t-distributions; 


=>,  V,  -ij^  ^x 


b(o3 


It  is  a  straightforward  matter  to  compute  2-  ov\<i  i   ,  so  we  shall 
merely  state  the  results; 

i"=    "^5    <-  .    vr    ^"  'e>~^A^-^l 


The  calculation  of  the  tail  probability  is  much  more  challenging. 
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After  the  transformation 


\IZ 


equation     88]    becomes 

\  {  f  v  -\ 


where 


/3  =    td.C  [  ^\r  +  c,  -  vn^Uw  ,  /  v^X)''"-^ 


and  all  primes  have  been  omitted  to  avoid  clutter.  After  applying 
formula  331.7a)  in  Grobner  and  Hofreiter  and  simplifying,  we  obtain 
the  final  result: 


where 

i;^-  i.  -  er-S  ,  S  -  i  or  O 


c  -^  ( ^'V" 


Vj   BCfiiB(^.i) 


and      "L^,^  -l}■^^,   Y\>,  *^^     are  restricted  to  integral  values.   Evidently, 
90]  must  be  evaluated  numerically. 
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This  completes  the  development  of  the  last  test  in  the  program 
announced  above.   We  apologize  for  failing  to  include  a  numerical 
example;  we  simply  lacked  the  patience  to  compute  the  prior  parameters. 

In  the  following  section  we  review  the  justification  for  the 
approach  we  have  taken  and  comment  on  the  philosophical  implications 
of  using  the  data  to  define  a  prior. 

7 .   Discussion 

The  fundamental  tenet  of  this  analysis  is  that  hypothesis 
testing  is  essentially  an  attempt  to  estimate  the  strength  of 
an  effect.   Because  an  effect  is  usually  measured  in  terms  of  a 
difference  in  proportions  or  means  that  can  never  be  known  precisely, 
any  estimate  must  be  based  on  a  diffuse  hypothesis  about  the  parameters 
of  interest.   Such  a  diffuse  hypothesis  is  merely  a  probability  density 
function,  a  quantification  of  our  belief  about  where  the  true  value 
of  the  parameter  lies. 

Clearly,  the  conditional  law  of  probability  provides  a  natural 
schema  to  make  inferences  about  parameters  if  one  can  unambiguously 
define  an  acceptable  prior  hypothesis.   The  problem  confronted  in 
this  paper  is  development  and  application  of  a  rule  for  choosing 
a  prior  hypothesis  which  is  both  unambiguous  and  conservative.   The 
rule  can  be  stated  very  simply:   choose  the  prior  hypothesis  so  that 
1)  its  expectation  is  the  conventional  null  value  and  2)  it  has  maximum 
probability  of  producing  the  observed  data.  For  mathematical  convenience, 
the  form  of  the  prior  is  tailored  to  combine  simply  with  the  conditional 
sampling  distribution.   If  a  conjugate  prior  seems  too  constraining,  one 
can  always  select  a  more  satisfying  function;  however,  the  above  rule 
for  defining  the  exact  form  of  the  prior  still  applies. 
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At  a  more  abstract  level,  the  choice  of  a  prior  has  been  posed 
as  a  problem  in  constrained  maximization.   Although  we  have  imposed 
only  one  constraint,  future  research  may  suggest  others.   For 
example,  one  could  formalize  the  fortuitous  result  of  the  difference 
of  means  tests  by  requiring  that  the  prior  and  posterior  variance  be 
the  same.   Of  course,  the  greater  the  number  of  constraints,  the 
lower  the  probability  that  the  chosen  prior  produced  the  data.    So 
we  prefer  to  keep  the  constraints  to  a  minimum. 

The  idea  of  using  the  data  to  define  a  prior  is  novel  if  not  a 

bit  shocking.   Indeed,  some  may  assert  that  an  hypothesis  is  no 

longer  a  prior  hypothesis  if  its  parameters  are  based  on  the  data 

about  which  we  propose  to  make  a  posteriori  inferences.  But  note  that 

we  do  specify  the  rule  in  advance  of  observation  much  as  a  careful 

researcher  spells  out  his  operational  definitions  before  subjecting 

the  data  to  analysis.  Moreover,  the  rule  is  constructed  to  produce 

a  null  (prior)  which  is  as  hostile  to  the  research  hypothesis  as  possible, 

in  light  of  the  observed  data.   We  shall  let  Clutton-Brock  (1965,  p.  27) 

have  the  closing  word: 

In  my  view,  we  are  entitled  to  use  any  evidence, 
prior  or  posterior  to  form  an  estimate  of  the 
prior  frequency  distribution.   I  prefer  to  retain 
the  word  "prior"  in  my  description  of  the  estimate, 
because  I  feel  that  the  name  of  what  is  estimated 
should  not  be  altered  by  the  method  of  estimation. 
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